Your browser doesn't support javascript.
Show: 20 | 50 | 100
Results 1 - 16 de 16
Filter
1.
Lrec 2022: Thirteen International Conference on Language Resources and Evaluation ; : 3093-3102, 2022.
Article in English | Web of Science | ID: covidwho-2310924

ABSTRACT

Specialist high-quality information is typically first available in English, and it is written in a language that may be difficult to understand by most readers. While Machine Translation technologies contribute to mitigate the first issue, the translated content will most likely still contain complex language. In order to investigate and address both problems simultaneously, we introduce Simple TICO-19, a new language resource containing manual simplifications of the English and Spanish portions of the TICO-19 corpus for Machine Translation of COVID-19 literature. We provide an in-depth description of the annotation process, which entailed designing an annotation manual and employing four annotators (two native English speakers and two native Spanish speakers) who simplified over 6,000 sentences from the English and Spanish portions of the TICO-19 corpus. We report several statistics on the new dataset, focusing on analysing the improvements in readability from the original texts to their simplified versions. In addition, we propose baseline methodologies for automatically generating the simplifications, translations and joint translation and simplifications contained in our dataset.

2.
Lrec 2022: Thirteen International Conference on Language Resources and Evaluation ; : 1068-1072, 2022.
Article in English | Web of Science | ID: covidwho-2310689

ABSTRACT

This paper presents a collection of parallel corpora generated by exploiting the COVID-19 related dataset of metadata created with the Europe Media Monitor (EMM) / Medical Information System (MediSys) processing chain of news articles. We describe how we constructed comparable monolingual corpora of news articles related to the current pandemic and used them to mine about 11.2 million segment alignments in 26 EN-X language pairs, covering most official EU languages plus Albanian, Arabic, Icelandic, Macedonian, and Norwegian. Subsets of this collection have been used in shared tasks (e.g. Multilingual Semantic Search, Machine Translation) aimed at accelerating the creation of resources and tools needed to facilitate access to information in the COVID-19 emergency situation.

3.
Psychological Test and Assessment Modeling ; 65(1):55-75, 2023.
Article in English | ProQuest Central | ID: covidwho-2306670

ABSTRACT

Keywords: Automated distractor generation, automated item generation, natural language processing, deep learning language models, prompt-based learning Language testing programs, like many other educational and psychological testing programs, face increasing demands for flexible test administrations. Since the COVID-19 pandemic, many language proficiency tests are offered to be taken at home with more available testing dates. [...]von Davier (2018) trained a long-short-term memory- (LSTM-) based recurrent neural network model and Hommel et al. [...]transformer-based models achieved state-of-the-art performance on a wide range of NLP benchmark tasks, such as the General Language Understanding Evaluation (GLUE;Wang et al., 2019), the Standard Question Answering Dataset (SQuAD;Rajpurkar et al., 2016), and the Situations with Adversarial Generations (SWAG;Zellers et al., 2018). A typical fine-tuning process consumes a large number of examples (oftentimes several tens of thousands), yet it is rare for a testing program to have such a large item pool. [...]we designed language prompts for distractors and leveraged the prompts in fine-tuning to address this small sample challenge.

4.
Computacion Y Sistemas ; 26(4):1669-1687, 2022.
Article in English | Web of Science | ID: covidwho-2226242

ABSTRACT

Machine translation deals with automatic translation from one natural language to another. Neural machine translation is a widely accepted technique of the corpus-based machine translation approach. However, an adequate amount of training data is required, and there is a need for the domain-wise parallel corpus to improve translational performance that shows translational coverages in various domains. In this work, a domain-specific parallel corpus is prepared that includes different domain coverages, namely, Agriculture, Government Office, Judiciary, Social Media, Tourism, COVID-19, Sports, and Literature domains for low-resource English-Assamese pair translation. Moreover, we have tackled data scarcity and word-order divergence problems via data augmentation and prior alignment concept. Also, we have contributed Assamese pretrained LM, Assamese word-embeddings by utilizing Assamese monolingual data, and a bilingual dictionary-based post-processing step to enhance transformer-based neural machine translation. We have achieved state-of-the-art results for both forward (English-to-Assamese) and backward (Assamese-to-English) directions of translation.

5.
10th Workshop on the Representation and Processing of Sign Languages: Multilingual Sign Language Resources, sign-lang 2022 ; : 139-143, 2022.
Article in English | Scopus | ID: covidwho-2207840

ABSTRACT

In this paper, we examine the linguistic phenomenon known as 'depiction', which relates to the ability to visually represent semantic components (Dudis, 2004). While some elements of this have been described for Irish Sign Language, with particular attention to the 'productive lexicon' (Leeson and Grehan, 2004;Leeson and Saeed, 2012;Matthews, 1996;O'Baoill and Matthews, 2000), here, we take the analysis further, drawing on what we have learned from cognitive linguistics over the past decade. Drawing on several recently developed domain-specific glossaries (e.g., Science Technology Engineering Math1 (STEM), Covid-192, political domain, Sexual, Domestic and Gender Based Violence (SDGBV)-related vocabulary) we present ongoing analysis indicating that a deliberate focus on iconicity, in particular, elements of depiction, appears to be a primary driver. We also outline some potential implications from Deaf-led glossary development work in the context of Machine Translation goals, for example, for work in progress on the Horizon 2020 funded SignON project. © European Language Resources Association (ELRA), licensed under CC-BY-NC 4.0.

6.
13th International Conference on Language Resources and Evaluation Conference, LREC 2022 ; : 6719-6727, 2022.
Article in English | Scopus | ID: covidwho-2170227

ABSTRACT

Previous research for adapting a general neural machine translation (NMT) model into a specific domain usually neglects the diversity in translation within the same domain, which is a core problem for domain adaptation in real-world scenarios. One representative of such challenging scenarios is to deploy a translation system for a conference with a specific topic, e.g., global warming or coronavirus, where there are usually extremely less resources due to the limited schedule. To motivate wider investigation in such a scenario, we present a real-world fine-grained domain adaptation task in machine translation (FGraDA). The FGraDA dataset consists of Chinese-English translation task for four sub-domains of information technology: autonomous vehicles, AI education, real-time networks, and smart phone. Each sub-domain is equipped with a development set and test set for evaluation purposes. To be closer to reality, FGraDA does not employ any in-domain bilingual training data but provides bilingual dictionaries and wiki knowledge base, which can be easier obtained within a short time. We benchmark the fine-grained domain adaptation task and present in-depth analyses showing that there are still challenging problems to further improve the performance with heterogeneous resources. © European Language Resources Association (ELRA), licensed under CC-BY-NC-4.0.

7.
New Voices in Translation Studies ; - (26):25-54, 2022.
Article in English | Scopus | ID: covidwho-2058137

ABSTRACT

COVID-19 is a global health crisis which has necessitated quick dissemination of reliable information to people in a language they understand. Although translators can play a significant role in crisis situations (Al-Shehari 2019), the enormous volume of COVID-19 information online to be translated may be too demanding for human translators. Therefore, translation technologies and resources including machine translation (MT), computer-assisted translation (CAT), and translation memory (TM) may assist in responding to this challenge. This study examines Jordanian translators’ views on utilising translation technologies in rendering COVID-19 material into Arabic during the pandemic. The quantitative five-scale Likert questionnaire was completed by 106 Jordanian translators. The findings show resistance to using MT by Jordanian translators, problems with translating COVID-19 related terms into Arabic, and a need to compile a unified glossary of COVID-19 related-terms that could be used across the Arab world. © 2022 International Association of Translation and Intercultural Studies. All rights reserved.

8.
6th IEEE International Conference on Cybernetics and Computational Intelligence, CyberneticsCom 2022 ; : 376-380, 2022.
Article in English | Scopus | ID: covidwho-2051963

ABSTRACT

Vietnam has achieved impressive economic growth in the last two decades. It becomes a worth investing country in the area. Consequently, the need of understanding foreign investors from different countries (S. Korea in specific) is an essential issue. Therefore, building an automatic machine translation system with high precision is a necessary solution, especially during the COVID-19 pandemic, where keeping distance is the best way to avoid spreading the virus. As a result, this research presents some experimental results on the TED Talks 2020 dataset for the task Korean - Vietnamese and Vietnamese - Korean machine translation with the purpose of providing an overview of the dataset and a deep learning machine translation model for the problem. © 2022 IEEE.

9.
23rd Annual Conference of the European Association for Machine Translation, EAMT 2022 ; : 287-288, 2022.
Article in English | Scopus | ID: covidwho-2044862

ABSTRACT

This project investigates the capabilities of machine translation (MT) models for generating translations at varying levels of readability, focusing on texts about COVID-19. Funded by the European Association for Machine Translation and by the Centre for Advanced Computational Sciences at Manchester Metropolitan University, we collected manual simplifications for English and Spanish texts in the TICO-19 dataset, and assessed the performance of neural MT models in this new benchmark. Future work will implement models that jointly translate and simplify, and develop suitable evaluation metrics. © 2022 The authors.

10.
JMIR Ment Health ; 9(9): e39556, 2022 Sep 06.
Article in English | MEDLINE | ID: covidwho-2022416

ABSTRACT

BACKGROUND: Patients with limited English proficiency frequently receive substandard health care. Asynchronous telepsychiatry (ATP) has been established as a clinically valid method for psychiatric assessments. The addition of automated speech recognition (ASR) and automated machine translation (AMT) technologies to asynchronous telepsychiatry may be a viable artificial intelligence (AI)-language interpretation option. OBJECTIVE: This project measures the frequency and accuracy of the translation of figurative language devices (FLDs) and patient word count per minute, in a subset of psychiatric interviews from a larger trial, as an approximation to patient speech complexity and quantity in clinical encounters that require interpretation. METHODS: A total of 6 patients were selected from the original trial, where they had undergone 2 assessments, once by an English-speaking psychiatrist through a Spanish-speaking human interpreter and once in Spanish by a trained mental health interviewer-researcher with AI interpretation. 3 (50%) of the 6 selected patients were interviewed via videoconferencing because of the COVID-19 pandemic. Interview transcripts were created by automated speech recognition with manual corrections for transcriptional accuracy and assessment for translational accuracy of FLDs. RESULTS: AI-interpreted interviews were found to have a significant increase in the use of FLDs and patient word count per minute. Both human and AI-interpreted FLDs were frequently translated inaccurately, however FLD translation may be more accurate on videoconferencing. CONCLUSIONS: AI interpretation is currently not sufficiently accurate for use in clinical settings. However, this study suggests that alternatives to human interpretation are needed to circumvent modifications to patients' speech. While AI interpretation technologies are being further developed, using videoconferencing for human interpreting may be more accurate than in-person interpreting. TRIAL REGISTRATION: ClinicalTrials.gov NCT03538860; https://clinicaltrials.gov/ct2/show/NCT03538860.

11.
Natural Language Engineering ; : 1-23, 2022.
Article in English | Web of Science | ID: covidwho-2016483

ABSTRACT

Cyberbullying is the wilful and repeated infliction of harm on an individual using the Internet and digital technologies. Similar to face-to-face bullying, cyberbullying can be captured formally using the Routine Activities Model (RAM) whereby the potential victim and bully are brought into proximity of one another via the interaction on online social networking (OSN) platforms. Although the impact of the COVID-19 (SARS-CoV-2) restrictions on the online presence of minors has yet to be fully grasped, studies have reported that 44% of pre-adolescents have encountered more cyberbullying incidents during the COVID-19 lockdown. Transparency reports shared by OSN companies indicate an increased take-downs of cyberbullying-related comments, posts or content by artificially intelligen moderation tools. However, in order to efficiently and effectively detect or identify whether a social media post or comment qualifies as cyberbullying, there are a number factors based on the RAM, which must be taken into account, which includes the identification of cyberbullying roles and forms. This demands the acquisition of large amounts of fine-grained annotated data which is costly and ethically challenging to produce. In addition where fine-grained datasets do exist they may be unavailable in the target language. Manual translation is costly and expensive, however, state-of-the-art neural machine translation offers a workaround. This study presents a first of its kind experiment in leveraging machine translation to automatically translate a unique pre-adolescent cyberbullying gold standard dataset in Italian with fine-grained annotations into English for training and testing a native binary classifier for pre-adolescent cyberbullying. In addition to contributing high-quality English reference translation of the source gold standard, our experiments indicate that the performance of our target binary classifier when trained on machine-translated English output is on par with the source (Italian) classifier.

12.
29th Iranian Conference on Electrical Engineering (ICEE) ; : 540-544, 2021.
Article in English | Web of Science | ID: covidwho-1853443

ABSTRACT

Fake news detection has become an emerging and critical topic of research in recent years. One of the major complications of fake news detection lies in the fact that news in social networks is multilingual, and therefore developing methods for each and every language in the world is impossible, especially for low resource languages like Persian. In an effort to solve this problem, researchers use machine translation to uniform the data and develop a method for the uniformed data. In this paper, we aim to explore the impacts of machine translation on fake news detection. For this purpose, we extracted and labeled a dataset of Persian Tweets from Twitter on the subject of COVID-19 and developed a method for detecting fake news on the extracted Tweets based on the SVM classifier, then we machine translated the data and applied our proposed method to it. Finally, the result for binary class (only fake and legitimate) fake news detection was 87%, and for multiclass (satire, misinformation, neutral and legitimate) fake news detection was 62%, and our findings demonstrate that machine translation has a 4% negative impact on binary classification accuracy and a 23% negative impact on multiclass classification.

13.
International Journal of Advanced Computer Science and Applications ; 12(11), 2021.
Article in English | ProQuest Central | ID: covidwho-1835998

ABSTRACT

Originating and striking from anywhere, cyber-attacks have become ever more sophisticated in our modern society and users are forced to adopt increasingly good and vigilant practices to protect from them. Among these, ransomware remains a major cyber-attack whose major threat to end users (disrupted operations, restricted files, scrambled sensitive data, financial demands, etc.) does not particularly lie in number but in severity. In this study we explore the possibility of real-time detection of ransomware source through a linguistic analysis that examines machine translation relative to the Levenshtein Distance and may thereby provide important indications as to attacker’s language of origin. Specifically, the aim of our research is to advance a metric to assist in determining whether an external ransom text is an indicator of either a human- or a machine-generated cyber-attack. Our proposed method works its argument on a set of Eastern European languages but is applicable to a large(r) range of languages and/or probabilistic patterns, being characterized by usage of limited resources and scalability properties.

14.
6th Conference on Machine Translation, WMT 2021 ; : 821-827, 2021.
Article in English | Scopus | ID: covidwho-1781813

ABSTRACT

The majority of language domains require prudent use of terminology to ensure clarity and adequacy of information conveyed. While the correct use of terminology for some languages and domains can be achieved by adapting general-purpose MT systems on large volumes of in-domain parallel data, such quantities of domain-specific data are seldom available for less-resourced languages and niche domains. Furthermore, as exemplified by COVID-19 recently, no domain-specific parallel data is readily available for emerging domains. However, the gravity of this recent calamity created a high demand for reliable translation of critical information regarding pandemic and infection prevention. This work is part of WMT2021 Shared Task: Machine Translation using Terminologies, where we describe Tilde MT systems that are capable of dynamic terminology integration at the time of translation. Our systems achieve up to 94% COVID-19 term use accuracy on the test set of the EN-FR language pair without having access to any form of in-domain information during system training. We conclude our work with a broader discussion considering the Shared Task itself and terminology translation in MT. © 2021 Association for Computational Linguistics

15.
17th International Scientific Conference on eLearning and Software for Education, eLSE 2021 ; : 37-43, 2021.
Article in English | Scopus | ID: covidwho-1786319

ABSTRACT

Both the emergence of the pandemic and lack of knowledge and/or time needed to translate texts related to this topic brought about an increased interest in analysing Neural Machine Translation (NMT) performance. In this study, we seek to fulfil the following purposes: 1.to examine types of translation errors from English into Romanian;2.to establish the causes of those errors or find possible explanations for them;3.to evaluate the quality and accuracy of Google Translate when translating health information from English to Romanian. The study provides theoretical insights, by explaining operational concepts, such as error, Machine Translation terminology, and the construction of Machine Translation lexicons and contrastive aspects of the two languages involved in the process. The lexical contrastive analysis, the branch of comparative linguistics, emphasizes the interferences of the languages in contact in the process of translation, the main objective being to optimize the machine translation system of operating. Some areas of knowledge, such as domain terminology, verb or noun patterns, multiword expressions, general vocabulary and domain-specific vocabulary are implemented in order to detect the lexical errors and to improve the lexical knowledge of Romanian in the Machine Translation database. The data collection comprises the coronavirus vaccine prospects and texts collected from official websites and translated using Google Translate and Google Languages Tools. Samples of errors, potential translation issues, and MT bad performance are manually examined in order to conduct a linguistic error analysis. Once the data is collected, the errors will be identified, classified, described and finally analysing the Machine Translation through a descriptive methodology. From the data analysed, there are more than 50 lexical and semantic errors that are approached through descriptive methodology. By examining types of errors in translation from English into Romanian and analysing the potential causes of errors, the results will be used to illustrate the quality and accuracy of Google Translate when translating public health information from English into Romanian, to observe how much the message is affected by the error, in order to sharpen up linguistic awareness. The results of the study can ultimately help improve of the quality of NMT in terms of better lexical selection and attempt to give inputs as a contribution for a more adequate translation into Romanian by Google Machine Translation. The proposed research will focus on the identification and typology of translation errors. In order to fulfil the objectives and to prepare the analysis of the lexical and semantic errors, the English texts will be paired with the equivalent version from Romanian given by Google Translate. This work presents a first step towards automatic analysis of machine translation output. The current research that analyses the raw English-Romanian translations from Google Translate about coronavirus will show the degree of intelligibility and the errors that could lead to misinformation or ambiguous contexts. © 2021, National Defence University - Carol I Printing House. All rights reserved.

16.
5th Conference on Machine Translation, WMT 2020 ; : 875-880, 2021.
Article in English | Scopus | ID: covidwho-1668616

ABSTRACT

In this paper we describe the systems developed at Ixa for our participation in WMT20 Biomedical shared task in three language pairs, en-eu, en-es and es-en. When defining our approach, we have put the focus on making an efficient use of corpora recently compiled for training Machine Translation (MT) systems to translate Covid-19 related text, as well as reusing previously compiled corpora and developed systems for biomedical or clinical domain. Regarding the techniques used, we base on the findings from our previous works for translating clinical texts into Basque, making use of clinical terminology for adapting the MT systems to the clinical domain. However, after manually inspecting some of the outputs generated by our systems, for most of the submissions we end up using the system trained only with the basic corpus, since the systems including the clinical terminologies generated outputs shorter in length than the corresponding references. Thus, we present simple baselines for translating s between English and Spanish (en/es);while for translating s and terms from English into Basque (en-eu), we concatenate the best en-es system for each kind of text with our es-eu system. We present automatic evaluation results in terms of BLEU scores, and analyse the effect of including clinical terminology on the average sentence length of the generated outputs. Following the recent recommendations for a responsible use of GPUs for NLP research, we include an estimation of the generated CO2 emissions, based on the power consumed for training the MT systems. © 2020 Association for Computational Linguistics

SELECTION OF CITATIONS
SEARCH DETAIL